BitNet b1.58 training #930

gau-nernst · 2024-09-24T14:04:59Z

This PR adds training code for BitNet b1.58 (ternary weights - 1.58 bit. The first version of BitNet is binary weights). This is implemented as tensor subclass and integrate nicely with the quantize_() API. I also added 2 extra optimizations:

Use INT8 tensor cores for forward pass: BitNet does row-wise abs-max scaling for INT8 activations and tensor-wise abs-mean scaling for ternary weights -> no brainer to use INT8 tensor cores
2-bit FSDP all-gather: I follow float8 optimization, which also performs tensor-wise scaling for weights -> quantize weights to ternary and pack to 2-bit for FSDP all-gather. It's possible to pack to smaller bits (nearer to 1.58-bit limit), but for simplicity, I pack to 2-bit.

Not optimized for inference (yet). A good baseline for inference would be something like A8W2 kernel from GemLite

BitNet b1.58

BitNet b1.58 uses ternary weights: each parameter can only take on 3 distinct values {-1, 0, +1}, thus making a BitNet model very compact. BitNet uses tensor-wise abs-mean scaling for weights (quantize to ternary) and row-wise abs-max scaling for activations (quantize to INT8).

BitNet is originally trained with QAT: the weights and activations are fake-quantized, and straight-through estimator (STE) is used to calculate gradients with respect to floating point weights. This process adds extra overhead over standard training. Our implementation utilizes INT8 Tensor Cores to make up for this loss in speed. In fact, our implementation is faster than BF16 training in most cases.

Usage

from torchao.prototype.quantized_training import bitnet_training
from torchao import quantize_

model = ...
quantize_(model, bitnet_training())

Note: following the BitNet Training Tips, Code and FAQ, user should insert extra RMSNorm before each nn.Linear layers and also remove the original RMSNorm before attention and MLP modules. Calling quantize_(model, bitnet_training()) will NOT perform this for you. You can take a look at our example training script benchmarks/quantized_training/pretrain_llama2.py on how to do this for our Llama model.

When used with FSDP2 training, you can pre-compute BitNet weight scales for the next iteration to synchronize all scales with a single all-reduce operation. This should be done after optimizer step.

from torchao.prototype.quantized_training import precompute_bitnet_scale_for_fsdp

for _ in range(n_steps):
  model(inputs).sum().backward()
  optim.step()
  precompute_bitnet_scale_for_fsdp(model)

Results

Convergence check Ran with my experimental repo https://github.com/gau-nernst/quantized-training. Llama2-1.1B (based on TinyLlama) for 1B tokens on FineWeb-Edu using 1x 4090. Baseline is full BF16 training. Each step is 4x8192 tokens.

Model	Train loss	Train tok/s	Hellaswag acc_norm
BF16 baseline	2.97	1300	0.3048
BitNet	3.05	1500	0.2953

Note: at this scale, we don't expect the loss curves to have a gap. According to Figure 2 of FAQ, the gap only appears at around 10B tokens.

Sanity benchmark with built-in training script Using benchmarks/quantized_training/pretrain_llama2.py. Llama2-1B on TinyStories, w/ 4090, 1k steps. Each step is 16x2048 tokens. PyTorch 2.4.0. Full BF16 training

Model	Train loss	tok/s
BF16 baseline	1.54	18,139
BitNet reference	1.52	17,510 (-3.5%)
BitNet ao (w/ INT8 tensor cores)	1.50	22,751 (+25%)

The train loss is a bit strange, but I think training on TinyStories is not so reliable. Perhaps it's just numerics. Side note that the speedup is impressive is because INT8 tensor cores is very fast on 4090 (up to 3.5x faster than BF16 tensor cores).

FSDP2 benchmark w/ torchtitan Using https://github.com/gau-nernst/torchtitan/tree/bitnet. Llama3-8B on C4, default config, 4x A100. torch==2.6.0.dev20240924

Model	Train loss	tok/s
BF16 mixed-precision baseline	4.94	2,573
BitNet (BF16 all-gather)	4.98	2,766 (+7.5%)
BitNet (2-bit all-gather)	4.97	2,876 (+12%)
BitNet (2-bit all-gather + precompute scale)	4.99	2,879 (+12%)

Note: due to the way torchtitan initialize weights, it's a bit troublesome to add extra RMSNorm layers as recommended by the paper. Thus, for benchmarks in torchtitan, I don't add extra RMSNorm.

pytorch-bot · 2024-09-24T14:05:02Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/930

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 77d5c0d with merge base 68e1886 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

vkuzo · 2024-09-27T14:54:32Z

benchmarks/quantized_training/pretrain_llama2.py

    elif args.quantize == "int8_mixed_precision":
        quantize_(model.layers, int8_mixed_precision_training(), set_inductor_config=False)
+
+    elif args.quantize == "bitnet":


optional: this is "change model architecture and then do quantization" which is pretty different from just "quantization". For code clarity, maybe we can either have an explicit preprocessing step to be called separately, or call the arg something like rmsnorm_model_surgery_then_quantize_bitnet?

vkuzo · 2024-09-27T14:59:05Z

torchao/prototype/quantized_training/bitnet.py

+
+def _pack_i2_in_i8(x: Tensor):
+    # NOTE: this is signed integer, so we have to mask before bit-shift
+    return (x[:, ::4] << 6) | ((x[:, 1::4] & 0b11) << 4) | ((x[:, 2::4] & 0b11) << 2) | (x[:, 3::4] & 0b11)


readability nit: write it out line by line with comments to make easier to understand for code readers?

vkuzo

nice!

LGTM for prototype but feel free to wait for other reviews if needed

msaroufim

Excellent work as usual! Feel free to merge this whenever you're ready @gau-nernst

* first upstream of BitNet * fix type annotation * skip bitnet test on cpu. add bitnet to benchmark script * add bitnet option to example training script. update backward * add FSDP2 test * remove FSDP2 mixed-precision workaround. cleanup test * fix typo * adjust tolerance * update command * add precompute scale for FSDP2 * fix typing * add test for precompute scale * rename * separate BitNet model surgery * minor fixes. add note on packing

first upstream of BitNet

6995ef3

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 24, 2024

gau-nernst added 12 commits September 24, 2024 22:20

fix type annotation

46668bf

skip bitnet test on cpu. add bitnet to benchmark script

c5e382d

add bitnet option to example training script. update backward

e3660eb

add FSDP2 test

974cd54

remove FSDP2 mixed-precision workaround. cleanup test

36e60c4

fix typo

5d12448

adjust tolerance

62218ef

update command

6990b61

add precompute scale for FSDP2

0e8f898

fix typing

d375fb2

add test for precompute scale

c966fce

rename

521b04e

gau-nernst marked this pull request as ready for review September 25, 2024 13:11

Merge branch 'pytorch:main' into bitnet_training

b8b3b84

gau-nernst requested review from andrewor14 and msaroufim September 25, 2024 13:24

msaroufim requested a review from vkuzo September 27, 2024 01:06

Merge branch 'pytorch:main' into bitnet_training

6ab6863

vkuzo reviewed Sep 27, 2024

View reviewed changes

vkuzo approved these changes Sep 27, 2024

View reviewed changes

msaroufim approved these changes Sep 27, 2024

View reviewed changes

gau-nernst added 4 commits September 29, 2024 08:10

Merge branch 'main' into bitnet_training

cd3b784

separate BitNet model surgery

2ca7460

minor fixes. add note on packing

1b5f616

Merge branch 'pytorch:main' into bitnet_training

77d5c0d

gau-nernst mentioned this pull request Oct 1, 2024

Support INT8 mixed-precision training from torchao? pytorch/torchtitan#578

Open

gau-nernst merged commit a382752 into pytorch:main Oct 1, 2024
17 checks passed

gau-nernst deleted the bitnet_training branch October 1, 2024 01:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BitNet b1.58 training #930

BitNet b1.58 training #930

gau-nernst commented Sep 24, 2024 •

edited

Loading

pytorch-bot bot commented Sep 24, 2024 •

edited

Loading

vkuzo Sep 27, 2024

vkuzo Sep 27, 2024

vkuzo left a comment

msaroufim left a comment

BitNet b1.58 training #930

BitNet b1.58 training #930

Conversation

gau-nernst commented Sep 24, 2024 • edited Loading

BitNet b1.58

Results

pytorch-bot bot commented Sep 24, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/930

✅ No Failures

vkuzo Sep 27, 2024

Choose a reason for hiding this comment

vkuzo Sep 27, 2024

Choose a reason for hiding this comment

vkuzo left a comment

Choose a reason for hiding this comment

msaroufim left a comment

Choose a reason for hiding this comment

gau-nernst commented Sep 24, 2024 •

edited

Loading

pytorch-bot bot commented Sep 24, 2024 •

edited

Loading